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Abstract 

Motivated by the problem of distributed geographical packet forwarding in a wireless sensor network with sleep- 
wake cycling nodes, we propose a local forwarding model comprising a node that wishes to forward a packet towards 
a destination, and a set of next-hop relay nodes, each of which is associated with a "reward" that summarises the 
cost/benefit of forwarding the packet through that relay. The relays wake up at random times, at which instants they 
reveal only the probability distributions of their rewards (e.g., by revealing their locations). To determine a relay's 
exact reward, the forwarding node has to further probe the relay, incurring a probing cost. Thus, at each relay wake-up 
instant, the source, given a set of relay reward distributions, has to decide whether to stop (and forward the packet to 
an already probed relay), continue waiting for further relays to wake-up, or probe an unprobed relay. We formulate 
the problem as a Markov decision process, with the objective being to minimize the packet forwarding delay subject 
to a constraint on the effective reward (the difference between the total probing cost and the actual reward of the 

00 ; 

r — chosen relay). Our problem can be considered as a variant of the asset selling problem with partial revelation of 

offers. 

in 

The most general class of decision policies can keep awake any or all the relays that have woken up. In this 
paper, we study the optimum over a restricted class of policies which, at any time, can keep only one unprobed relay 
awake, in addition to the best among the probed relays. We prove that the optimum stopping policy over this class 
is of threshold type, where the same threshold is used at each relay wake-up instant. Numerically, we find that the 
performance of the optimum over the restricted class is very close to that over the unrestricted class. 
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H \ I. Introduction 

a i 

Consider a wireless sensor network deployed for the detection of a rare event, Q], 0, (3), e.g., forest fire, intrusion 
in border areas, etc. To conserve energy, the nodes in the network sleep-wake cycle whereby they alternate between 
an ON state and a low power OFF state. We are further interested in asynchronous sleep-wake where the point 
processes of wake-up instants of the nodes are mutually independent JT), El- 

In such networks, whenever an event is detected, an alarm packet (containing the event location and a time stamp) 
is generated and has to be forwarded, through multiple hops, to a control center (sink) where appropriate action 
could be taken. In the present work we consider the traffic model in which one alarm packet is generated for each 
detected event. One may argue that in the case of an emergency it may be better to simply broadcast the alarm 
packet. A naive uncontrolled broadcast can lead to broadcast storm [5|. There have been proposals for forwarding 



packets via controlled broadcast, e.g., Gradient Broadcast (GRAB) |6|. It is, however, difficult to optimize the 
tradeoffs in these protocols, such as forwarding delay and network power consumption. 

Instead, in our work we seek to optimize the forwarding of a single alarm packet, in the face of sleep-wake 
cycling. Natural extensions of this work would be to consider the controlled forwarding of multiple copies of the 
alarm. Also, instead of simple multihopping we could use cooperative forwarding techniques. We are considering 
these aspects in our ongoing work. 

Now, since the network is sleep-wake cycling, a forwarding node (i.e., a node with an alarm packet) has to wait 
for its neighbors to wake-up before it can choose a neighbor as the next hop. Thus, there is a delay incurred, 
due to the sleep-wake process, at each hop enroute to the sink. We are interested in minimizing the total average 
delay subject to a constraint on some global metric of interest such as average hop counts, or the average total 
transmission power (sum of the transmission power used at each hop). Such a global problem can be considered 
as a stochastic shortest path problem [7], for which a distributed Bellman-Ford algorithm (e.g., the LOCAL-OPT 
algorithm proposed by Kim et al. in |fl]]) can be used to obtain the optimal solution. A major drawback with such 
an approach is that a pre-configuration phase is required to run such algorithms, which would involve exchange of 
several control messages. 

The focus of our research is, instead, towards designing simple forwarding rules using only the local information 
available at a forwarding node. For instance, in our earlier work (J4|, 0) on the problem of minimizing total 
end-to-end delay subject to a hop-count constraint, we formulated a local forwarding problem of minimizing one 
hop delay subject to a constraint on the progress (towards the sink) made by the chosen relay. We showed that the 
tradeoff, between the end-to-end metrics (total delay and hop counts), obtained by applying the local solution at 
each hop enroute to the sink is comparable with that obtained by the distributed Bellman-Ford algorithm. However, 
in our earlier model we had not considered the effect of random channel gains on the relay selection process. Thus, 
the current work is an extension of our local forwarding problem where now, due to the channel gains, new features 
such as, the action to probe a relay (to learn its channel) and the associated energy cost (probing cost), have to be 
incorporated while choosing a next-hop relay. 

Outline and Contributions: In Section [II] we formulate the local forwarding problem of choosing a relay for the 
next-hop in the presence of random sleep-wake cycling and unknown channel gains. Our problem is a variant of the 
optimal stopping problem where we have a probe action, in addition to the traditional stop and continue actions. 
The main technical contributions are: 

• We characterize (in Section [TTTb the optimal policy over a restricted class (referred to as RST-OPT) in terms 
of certain stopping and stopping/probing sets (Section ITVil. 

« We prove that the stopping sets have threshold structure and further the thresholds are stage independent. This 
can be considered as a generalization of the one-step-look ahead rule (see the Remark following Theorem [2]). 

• Finally, through numerical work (Section IVII ) we find that the performance of RST-OPT is close to that of the 
global optimal (GLB-OPT). 

For the ease of presentation, we have moved most of the proofs to the Appendix. 



II. The Local Forwarding Problem: System Model 

In this section we will develop our system model from the context of geographical forwarding |@), (9), iflOl . By 
geographical forwarding, also known as location aware routing, we mean that each node in the network knows its 
location (with respect to some reference) and the location of the sink. 

Consider a forwarding node (henceforth referred to as source) at a distance vo from the sink (see Fig. [TJ. The 
communication region is the set of all locations where reliable exchange of control messages can take place between 
the source and a receiver, if any, at these locations. In Fig. Q] we have shown the communication region to be circular, 
but in reality this region can be arbitrary. The set of nodes within the communication region are referred to as 
neighbors. Let vg represent the distance of a location £ (a location is a point in K 2 ) from the sink. Then define the 
progress of the location £ as Zg := vg — vq. The source is interested in forwarding the packet only to a neighbor 
within the forwarding region C where, C — {£ E communication region : Zg > 0}. The forwarding region is shown 
hatched in Fig. [T] We will refer to the nodes in the forwarding region as relays. 




Fig. 1. The hatched area is the forwarding region C. For a location ( 6 C. the progress Zg is the difference between the source-sink distance, 
vo, and the location-sink distance, vg. 

For the channel between the source and a relay at location £ (if any) we will assume the Rayleigh fading model 
so that the received symbol Y at £ can be written as 



Yg = ^PdJ p HgX + Wg 

where, P is the transmit power, dg is the distance between the source and the location I, j3 is the path loss attenuation 
factor usually in the range 2 to 4, Hg is the random channel gain (Hg ~ C(0, 1)), X is the unit-power transmitted 
symbol (i.e., E|X| 2 = 1), Wg is complex Gaussian noise of variance N (Wg ~ C(0, N )). By Hg ~ C(0, 1), we 
mean that Hg is complex Gaussian random variable of zero mean and unit variance. 

We will assume that the set of channel gains, {Hg : £ E £}, is homogenous i.e., {Hg} are independent. We will 
further assume a quasi static channel model where, each time the source gets a packet to forward, the channel gains 
{Hg } remain unchanged over the entire duration of the decision process. However, these would have changed the 
next time the source gets a packet to forward. In physical layer wireless terminology, we have a fiat fading, slowly 
varying channel. 

The instantaneous received SNR (signal to noise ratio) at £ is given by SNR(P, dg) = — ^ — — , A transmission 
is considered to be successful (i.e., a receiver at £ can decode reliably) if the received SNR(P,dg) is above a 



threshold I\ Then, given the channel gain Hi, the minimum power required to achieve the SNR constraint is 
p _ rN a 4 

Reward Structure: Several metrics have been proposed in the literature ifTTI . to enable a forwarding node to 
evaluate its neighbors, before choosing one for the next hop. In the current work we prefer to use a reward metric, 
which is a combination of the progress (Zi) and the minimum power required (Pi). Formally, to each location 
£ G C we associate a reward Rg as 

R -#^(F^ (|fl,|2)<, "" ) ' 

where a e [0, 1] is used to tradeoff between Zg and Pi. The reward being inversely proportional to Pi is clear, 
because it is advantageous to use less power to get the packet across. Rg is made proportional to Zg to give a 
sense of progress towards the sink, while choosing a relay for the next hop. Further motivation for choosing the 
particular structure for the reward is available in Appendix IVIII- A| Finally, let Fg represent the distribution of Ri. 
Thus, there is a collection of reward distributions T indexed by t 6 C. From ([T), note that, given a location £ it 
is only possible to know the reward distribution Fg. To know the exact reward Rg, the source has to send probe 
packets and learn the channel gain Hg. 

Sleep-Wake Process: Without loss of generality, we will assume that the source gets a packet to forward (either 
from an upstream node or by detecting an event) at time 0. There are N relays in the forwarding set C that wake-up 
sequentially at the points of a renewal process, W\ < W2 < ■ • • < Wn- Let Uk '■= Wk — Wk-i {U\ :— W\) 
denote the inter-wake-up time (renewal lifetime) between the k and the (k — l)-th relay. Then U\, U%, ■ • • , Un, are 
independent with their means, equal to r. For example, Uk, 1 < k < N, could be exponentially distributed with 
mean r, or could be constant (deterministic) with value r. 

Sequential Decision Problem: Let L\, L 2l ■ ■ ■ , Ljv, denote the relay locations which are i.i.d. uniform over the 
forwarding set C. Let A(-) denote the uniform distribution over £ so that, for k = 1, 2, • • • ,N, the distribution of Lk 
is A. The source (with a packet to forward at time 0) only knows that there are N relays in its forwarding set C, but 
apriori does not know their locations (Lk) nor their channel gains (Hx, k ), When the k-th relay wakes up, we assume 
that its location Lk and hence the reward distribution Fl h is revealed to the source. This can be accomplished by 
including the location information Lp. within a control packet (sent using low rate robust modulation technique, and 
hence, assumed to be error free) transmitted by the fc-th relay upon wake-up. However, if the source wishes to learn 
the channel gain H^ k (and hence the exact reward value i?L fc ), it has to send additional probe packets incurring 
an energy cost of 6 units. Thus, when the k-th relay wakes up (referred to as stage k) the actions available to the 
source are: 

• Stop and forward the packet to a previously probed relay accruing the reward of that relay. It is clear that it 
is optimal to forward, among all the probed relays, to the relay with the maximum reward. Thus, henceforth, 
the action stop always implies that the best relay (among those that have been probed) is chosen. With the 
Stop action the decision process terminates. 

• continue to wait for the next relay to wake-up (the average waiting time is r) and reveal its reward distribution, 



at which instant the decision process is regarded to have entered stage k + 1. 
• probe a relay from the set of all unprobed relays (provided there is at least one unprobed relay). The probed 
relay's reward value is revealed allowing the source to update maximum reward among all the probed relays. 
After probing, the decision process is still at stage k and again the source has to decide upon an action. 
In the model, for the sake of analysis, we neglect the time taken to probe a relay and learn its channel gain. 
We also neglect the time taken for the exchange of control packets. This is reasonable for very low duty cycling 
networks where the average inter-wake-up times are much larger than the time taken for probing, or the exchange 
of control packets. 

A decision rule or a policy is a mapping, at each stage, from all histories of the decision process to the set of 
actions. Let II represent the class of all policies. For a policy 7r € II the delay incurred, D v , is the time until a relay 
is chosen (which is one of the Wk). Let i?„. represent the reward of the chosen relay and M OT be the total number 
of relays that were probed during the decision process. Recalling that 5 is the cost of probing, SM* represents the 
total cost of probing using policy n. We would like to think of (R^ — 6M„) as the effective reward obtained using 
policy 7r. The problem we are interested in is the following: 

min ¥,D n 

Tren 

Subject to: (eR^ - SEM„) > 7, (2) 

where 7 > is a constraint on the effective reward. We introduce a Lagrange multiplier 77 > and focus our 
attention towards solving the following unconstrained problem: 

min (eD^ - riER^ + r)5EMj) . (3) 

We will call II the complete class of policies, as II contains policies that are allowed, for each stage k, to 
keep all the relays until stage k awake. Formally, if bk = max < Ri : i < k, relay i has been probed > and 
J-k = \ Fl { : i < k, relay i is unprobed >, then the decision at stage k is based on (bk, Tk)- Thus, the set of all 
possible states at stage k is large. Hence for analytical tractability we first consider (in Section HIB solving the 
problem in (0 over a restricted class of policies II C IT where a policy in II is restricted to take decisions keeping 
only two relays awake, one the best among all probed relays and one among the unprobed ones, i.e., the decision 
at stage k is based on (bk, Gk) where Gk G Tk- 

A. Related Work 

The work reported in this paper is an extension of our earlier work (|4j, (U). The major difference is that in 
(4) we assume that when a relay wakes up its exact reward value is revealed. This is reasonable if the reward is 
simply the geographical progress (towards the sink) made by a relay, which was the case in O. In JS) we studied 
a variant where the number of relay N is not known to the source. However, in J8| as well the exact reward valued 
is revealed by a relay upon wake-up. 

There has been other work in the context of geographical forwarding and anycast routing, where the problem 
of choosing one among several neighboring nodes has been studied iflOl . lfl2l . The authors in 1 10 1 study the 



policy of always choosing a neighbor that makes the maximum progress towards the sink. Thus, they do not 
study the tradeoff between the relay selection delay and the progress (or other reward metric), which is the major 
contribution of our work. For a sleep-wake cycling network, Kim et al. in [ 1 1 have developed a distributed Bellman- 
Ford algorithm (referred to as LOCAL-OPT) to minimize the average end-to-end delay. However a pre-configuration 
phase, involving lot of control message exchanges, is required to run the LOCAL-OPT algorithm. 

In all of the above work, the metric that signifies the quality of a relay is always exactly revealed, and hence 
does not involve a probing action. The action to probe generally occurs in the problem of channel selection [13], 
lTl4l . We will discuss the parallels and differences of these works (particularly that of Chaporkar and Prouriere in 
ifPD ) with ours in detail in Section IV1 

Finally (but very importantly) our work can be considered as a variant of the asset selling problem which is a 
major class within the optimal stopping problems |[T5l (the other classes being the secretary problem, the bandit 
problem, etc.,). The basic asset selling problem 1161 . ifTTl . comprises offers that arrive sequentially over time. The 
offers are i.i.d. As the offers arrive, the seller has to decide whether to take an offer or wait for future offers. The 
seller has to pay a cost to observe the next offer. Previous offers cannot be recalled. The decision process ends with 
the seller choosing an offer. Over the years, several variants of the basic problem have been studied, e.g., problems 
with uncertain recall ITSl . problems with unknown reward distribution |fl9l , etc. In most of the variants, when an 
offer arrives, its value is exactly revealed to the seller, while in our case only the offer (i.e., reward) distribution is 
made available and the source, if it wishes, has to probe to known the exact offer value. 

Close to our work is that of Stadje [20] where only some intial information about an offer (e.g., the average size 
of the offer) is revealed to the decision maker upon its arrival. In addition to the actions, stop and continue, the 
decision maker can also choose to obtain more information about the offer by incurring a cost. Recalling previous 
offers is not allowed. A similar problem is studied by Thejaswi et al. in 11211 . where initially a coarse estimate of 
the channel gain is made available to the transmitter. The transmitter can choose to probe the channel, the second 
time, to get better estimate. For both these works fll20l . ET\ ), the optimal policies are characterized by thresholds. 
However, the horizon length of these problems is infinite, because of which the thresholds are stage independent. 
In general, for a finite horizon problem the optimal policy would be stage dependent. However, for our problem 
(despite being a finite horizon one) we are able to show that certain optimal stopping sets are identical at every 
stage. This is due to the fact that we are allowing the best probed relay to stay awake. 

III. Restricted Class n: An MDP Formulation 

In this section we will formulate our local forwarding problem as a Markov decision process [22], which will 
require us to first discuss, the one-step cost and the state transition. 

A. One-Step Costs and State Transition 

Recall that for any policy in the restricted class n, the decision at stage k should be based on (bk,Gk) where bk 
is the best reward so far and Gk G Th is the reward distribution of an unprobed relay that is retained until stage k. 



If the source's decision is to stop, then the source enters a terminating stage ip incurring a cost of — r/bk (recall 
from © that 77 is the Lagrange multiplier). 

If the action is to continue then the source will first incur a waiting cost of Ut+i (average cost is r). When 
the (k 4- l)-th relay wakes-up (whose reward distribution is Fi, k+1 ), first the source has to choose between Gk and 
Fl h+1 , then put the other one to sleep so that the state at stage k + 1 will again be of the form (bk+i, Gk+i)- Since 
the action at stage k was to continue, the set of probed relays at stage k + 1 remain unchanged so that bk+i = &fe- 

Finally the source could, at stage k, probe the distribution Gk incurring a cost of r)S, After probing the decision 
process is still considered to be at stage k where now the state is b' k = max{^, Rk} where (Rk is a sample from 
the distribution Gk)- The source has to now further decide whether to stop (incurring a cost of ~r\b and enter ijj) 
or continue (cost is Uk+i and the next state is (b' k ,FL k+1 )). Note that for a policy n, the sum of all the one-step 
costs, starting from stage 1, will equal the total cosjj in (01. 

B. Bellman Equation and the Cost-to-go Functions 

Let Jfc(-), k — 1,2, ••• ,N, represent the optimal cost-to-go function at stage k. Thus, Jk{b) and Jk{b,Fg) 
denote the cost-to-go, depending on whether there is or is not an unprobed relay. For the last stage, N, we have, 
JnQ>) — —r]b, and 

Jjss(b,Fg) = min < — i]b, rj8 + Eg Jjv(max{6, Rg}) > 



= mm < — 



J - ?y&, rjS 



-r)Ei 



i{b,Rt}]} 



(4) 



where Eg denotes expectation w.r.t. the distribution Fg. The first term in the min-expression above is the cost of 
stopping and the second term is the average cost of probing and then stopping (the action continue is not available 
at the last stage N). Next, for stage k — N ~ 1,N — 2, ••• ,1, denoting by E^ the expectation over the distribution 
A of the index, Lk+i, of the next relay to wake up, we have 



J k (b) = min { - rjb, t + E, 4 J k +i {b, F Lk+1 ) } 



and 



J k (b, Fg) = min j - r]b, i]5 + Eg J fe (max{6, Rg}) 



i{Jk + i(b,Fg),J k+1 (b,F Lk+1 )}}} 



(5) 



(6) 



The last term in the min-expression of (f5]) and (|6]l is the average cost of continuing. When the state at stage k, 
1 < k < N — 1, is (b, Fg) and, if the source decides to continue, then the reward distribution of the next relay 
is FL k+1 . Now, given the distributions Fg and i*i, fc+1 , if the source is asked to retain one of them, it is always 
optimal to go with the distribution that fetches a lower cost-to-go from stage k + 1 onwards, i.e., it is optimal to 



'Since every policy has to invariably wait for the first relay to wake-up, at which instant the decision process begins, U± is not accounted 
for in the total cost by any policy n. 



I 

h lU- 



retain Ft if Jk+i(b, Ft) < J k+ i(b,FL k+1 ), otherwise retain -Fz, fc+1 D In the next section we will show that if F#. is 
stochastically greater than F u then J k+ i(b,Ft) < J k+ i(b, F u ) (see Lemma |2}(ii)). 

First, for simplicity let us introduce the following notations. For k = 1, 2, • • • , N — 1, let cc k (-) represent the 
cost of continuing, i.e., 

cc k {b) = T + E A [j k+1 {b,F Lk+1 )] (7) 

cc k (b,F e ) = T + E j4 [min{J fc+ i(6,fi),J fc+1 (6,F Zfc+1 )}], (8) 

and, for k — 1, 2, • • • , N, the cost of probing, cp k (-), is 

c Pk (b,F e ) = nS + ~Et[jk(jajax{b,Ri})]. (9) 

The above expressions will be used extensively from now on. From (01 and (0 it is immediately clear that 
cc k (b,Ft) < cc k (b) for any Ft (£ <E £). We note this down as a Lemma. 
Lemma 1: For any b and Ft we have CCk(b, Ft) < cc k (b). 

U 
Using the above notation, the cost-to-go functions can be written as 

J k {b) = min {- V b, cc k (6)} (10) 

Jk(b,Ft) = mml-r)b,cp k (b,Ft),cc k (b,Fe)j. (11) 

C. Ordering Results for the Cost-to-go Functions 

We begin with the definition of stochastic ordering. 

Definition 1 (Stochastic Ordering): Given two distributions Ft and F u we say that Ft is stochastically greater 
than F u , denoted as Ft > s t F u , if Ft(x) < F u (x) for all x. Alternatively, one could use the following definition: 
Ft > s t F u if and only if for every non-decreasing (non-increasing) function / : 3? — > 5ft, E^ [f(Re)] > E tt [/(i? u )] 
(Et[f(Rt)] < K u [f(R u )]) where the distributions of Re and R u are Ft and F u , respectively. ■ 

The ordering properties of the cost-to-go functions are listed in the following lemma. 

Lemma 2: For 1 < k < N (for part (iii), 1 < k < N - 1), 

(i) Jk{b) and Jkib,Fi) are decreasing in b. 

(ii) if Ft > st F u then J k (b,F e ) < J k (b,F u ). 
(iii) Mb) < J k +i(b) and J k (b,Ft) < J k+1 (b,Ft). 

Proof: Part (i) and (iii) follow easily by straightforward induction. To prove part (ii), we need to use part (i) 
and the definition of stochastic ordering (Definition [TJ. Formal proof is available in Appendix IVIH-BI ■ 

The following lemma is an immediate consequence of Lemma |2-(ii) and El-(iii), respectively, 

Lemma 3: (i) For 1 < k < N - 1, if Ft > s t F u then cp k (b, Ft) < cp k (b, F u ) and cc k (b, Ft) < cc k (b, F u ). 

2 Formally one has to introduce an intermediate state of the form (b, F(, Fij k ) at stage k + 1 where the only actions available are, choose 
F t or F Lk+1 . Then J k+1 (b, F e , F Lk+1 ) = min{J k+1 (b, F e ), J k+1 (b, F Lk+1 )}, which, for simplicity, we are directly using in JS}. 



(ii) For 1 < k < N - 2, cc k (b) < cc k+l (b), cp k {b,F e ) < cp k+1 (b,F e ) and cc k (b, Fg) < cc k+1 {b,Fg). 

U 

IV. Restricted Class II: Structural Results 

We begin by defining, at stage k, the optimal stopping set S k . Similarly, for a given distribution Fg we define 
the optimal stopping set S^ and the optimal stopping/probing set Q e k . For k = 1, 2, • • • , N — 1, 

S k :={b:-t}b<cck(b)}, (12) 

and, for a given distribution Fg, 

S e k := {b:-Tib<mln{cp k (b,Ft),cck(b,Ft)}} (13) 

Ql ■■= {b:mm{-rib,cp k (p,F t )}<cc k (b,F t )y (14) 

From © it follows that the optimal stopping set S k is the set of states b (states of these form are obtained after 
probing at stage k) where it is better to stop than to continue. Similarly from (O, the set <S| has to be interpreted 
as, for a given distribution Fg, the set of b such that at the state (b, Ft) it is better to stop, while the set Q| is the 
set of b such that at (b, Fg) it is better to either stop or probe. 

From the definition of these sets, the following ordering properties are straightforward. 

Lemma 4: For k = 1, 2, • • • , N - 1 and any Fg we have (i) S e k C Q[ (ii) S{ C S fe (iii) if Fg > st F u then 
S{ C 5^ and (iv) 5 fc C 5 fe+ i and 5| C 5| +1 . 

Proof: Recall the definitions in (fT2l) . dl~3l > and (fT4l i. Then (i) simply follows from the definitions. For (ii), recall 
from Lemma [T] that cc k (b,Fi) < cc k (b). Finally, (iii) and (iv) are simple consequences of Lemma [5}(i) and[5tii), 
respectively. ■ 

However it is not immediately clear how the sets Qjr and S k are ordered. Later in this section we will, under 
the assumption that T is totally stochastically ordered (to be defined next), show that S k C Q|. This result will 
enable us to prove that the stopping sets, S k , are identical at every stage k (Theorem O. Similarly, in Theorem [5] 
we will show that, for any distribution Ft, Sf. are also identical across all stages k. First we need the following 
assumption. 

Assumption 1 (Total Stochastic Ordering Property of T): From here on, we will assume that T is totally stochas- 
tically ordered meaning, any two distributions from T are always stochastically ordered. Formally, if F?,F U 6 T 
then either Fg > st F u or F u > st Fg. We further assume the existance of a minimal distribution F m G T such that 
for every Fg 6 T we have Fg > st F m . ■ 

Note that such a restriction on T is not very stringent, in the sense that the set of distributions arising in our 
local forwarding problem in Section m {Fg : i e forwarding set}, is totally stochastically ordered. 

A. Stopping Sets: Threshold Rule 

Before showing the stage independence property of the stopping sets, first in this section we will prove another 
important property about the stopping sets, that they are characterized by thresholds. The following is the key 
lemma for proving such a property. 
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Lemma 5: For 1 < k < N — 1 and for 62 > 61 we have 
(i) cc fe (6i) - cc fc (6 2 ) < 77(62 - h), 
and for any distribution Ft we have, 
(ii) cpk(h,Fi) - cp k (b 2 ,F e ) < 77(62 - h) 
(iii) cck(buFi) - cc k (b 2 , F e ) < 77(62 - 61). 

Proof: See Appendix IVIILCl ■ 

Theorem 1: For 1 < k < N — 1 and for 62 > b\, 
(i) if 61 € S fe then 6 2 € <S fe , 
(ii) for any i^, if 61 € <S| then 6 2 G Sj£. 

Proof: Part (i): Using part (i) of the previous lemma (Lemma |5j-(i)) we can write, 

-7/6 2 < -7761 - cc k (bi) + cc k {b 2 ). 

Since 61 G S k , from (TT2l i we know that —7761 < cc/s(6i), using which, in the above expression, we obtain —7762 < 
cc k (62) so that 62 G S k . Part (ii) similarly follows using Lemma |5}(ii) and [5j(iii), respectively, to show —7762 < 
cp k (b 2l F e ) and ~7]b 2 <cc k (b 2 ,Fi). ■ 

Discussion: Thus, the stopping set S k can be characterized in terms of a lower bound x k as illustrated in Fig. |2(a)| 
Similarly for distributions F( and F u , there are (possibly different) thresholds x k and x l u (Fig. |2(b)] and |2(c)"| i. Using 
Lemma Hl-(ii) andHJ-(iii), for Fg > st F u we can write, x k < x^ < x l k . These thresholds are for a given stage k. 
From Lemma IU-(iv), we know that the thresholds x k and x k are decreasing with k. The main result in the next 
section (Theorem [2] and [3]) is to show that these thresholds are, in fact, equal (i.e., x k = x k +i and xj. = x k+1 ). 
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Fig. 2. Illustration of the threshold property. S, P and C in the figure represents the actions Stop, probe and continue, respectively. |(a)| 
5fe is characterized by the threshold Xfc. |(b)| and |(c)| depicts the stopping sets corresponding to the distributions F( and Fu, respectively. The 
distributions are such that Fi > s t F u . 



B. Optimal Probing Sets 

Now, similar to the stopping set S k , one can define an optimal probing set V k as, the set of all 6 such that at 
(6, Ft) it is better to probe than to stop or continue, i.e., 

V{ := {b:c P k(b,Ft)<mm{-r]b,cc k {b,F £ )}y (15) 



Alternatively, Vf. is simply the difference of the sets Q k and S l k , i.e., V k — Q k \ S k . In our numerical work we 
observed that, similar to the stopping set, the probing set was characterized by an upper bound y k as illustrated in 
Fig. |2(b)| (where y k — x k ) and |2(c)| At the time of this writing we have not proved such a result. However, we 
strongly believe that it is true and make the following conjecture, 

Conjecture 1: For 1 < k < N — 1, for any Fi, if hi G Vi then for any b\ < 62 we have b\ G V k . ■ 

The above conjecture, along with Lemma [7] (in the next section, where we show that, for any Ft, Sk Q Q e k ), 
is the reason for the illustration in Fig. |2(b)| to not contain a region where it is optimal to continue. Whenever 
x k > Xk, from Lemma [7] we know that it is optimal to probe for every b G [xk,x k ). Then the above conjecture 
suggests that it is optimal to probe for every b < x k , leaving no region where it is optimal to continue. Unlike x k , 
the thresholds y k are stage dependent. In fact, from our numerical work, we observe that y k are increasing with k. 

C. Stopping Sets: Stage Independence Property 

In this section we will prove that the stopping sets are identical across the stages. First we need the following 
results. 

Lemma 6: Suppose Sk Q Q e k for some k = 1, 2, • • • , N — 1, and some Ft, then for every b G Sk we have 
J k {b,F t ) = J N {b,F i ). 

Proof: Fix a b G Sk C Q k . From the definition of the set Q k (from (fl4] >) we know that at (6, Fg) it is optimal 
to either stop or probe. Further, after probing the state is max{6, Rg} G Sk (from Lemma [TJ so that it is optimal 
to stop. Combining these observations we can write 



J k (b, F e ) = min J - r]b, r)6 + E { J fc (max{6, i? £ })] j 
= min < — i]b, rjd — r/Eg max{&, Rg} > 
= J N (b,F e ). 



Lemma 7: For 1 < k < N — 1 and for any Ft G J- we have Sk Q Q k . 

Proof Outline: For a formal proof see Appendix I VIII-DI We only provide an outline here. 

Step 1: First we show that (Lemma [TOl in Appendix I VIII-Dt . if there exists an F u such that Fg > st F u and for 
every l<fc<iV— 1, iSfcC Q k then we have Sk Q Q k . This requires the total stochastic ordering of F . 

Step 2: Next we show that (Lemma [TT] in Appendix IVIII-Db the minimal distribution F m satisfies, for every 
1 < k < N - 1, S k C Qf. 

The proof is complete by recalling that Ft > st F m for every Fg G T (Assumption [T]) and then using, in Step 1, 
F m in the place of F u . ■ 

The following are the main theorems of this section, 

Theorem 2: For 1 < k < N - 2, S k = S k +i- 

Proof: From Lemma IU-(iv) we already know that Sk Q S k +i. Here, we will show that Sk 2 Sk+i- Fix a 
b G iSfc+i C Sk+2- From Lemma |7] we know that Sk+i C Q k+1 and Sk+2 Q Q-k+2^ ^ OT ever y Ft- Then, applying 



Lemma|6]we can write, Jfc +2 (6,i<i) = Jk+i(b } Fg) = Jjv(6,i^). Thus, 

cc k +i(b) 



T + E.4 


Jk+2{b,F Lk+2 ) 


T + E A 


J k +i{b,F Lk+1 ) 


cc k {b) 





Now, since b e Sk+i we have — rjb < cck+i(b) = cck(b) which implies that b e <Sfc. ■ 

Remark: It is worth pointing out the parallels and the differences of the above result with that in l22l Section 4.4, 
pp-165], where it is shown that the one-step-look ahead rule is optimal, but for the case where the rewards are 
exactly revealed. There, as in our Lemma [6j the key idea is to show that the cost-to-go functions, at every stage k, 
are identical for every state within the stopping set. For our case, to apply Lemma [6] it was further essential for us 
to prove Lemma [7] showing that for every Fg, Sk C C|. Now, for 6 — 0, Lemma [7] trivially holds, since at (b, Fg) 
it is always optimal to probe (so that C k = K+). Further, 6 = can be thought of as the case where the rewards 
are exactly revealed. Thus, Theorem [2] can be considered as a generalization of the result in ll22l Section 4.4] for 
the case, S > 0. ■ 

For any Fi, the stopping sets Sf. are also stage independent. 

Theorem 3: For 1 < k < N — 2 and for any Fg, Si = Sj. +1 . 
Proof: Similar to the proof of Theorem here we need to show that, for b 6 Si, the probing and continuing 
costs satisfy, cpk+i(b,Fg) = cpk+2(b,Fg) and cck+i(b, Fg) = cck+2(b,Fg), respectively. See Appendix IVIH-EI ■ 

From the previous subsection we already know that the optimal policy is completely characterized, at every stage 
k, in terms of the thresholds Xk and {x^yi} for every Fg. From Theorem |2] and [3] in this section, we further know 
that the thresholds x^ and xj. are stage independent and hence have to be computed only once for stage N — 1, 
thus simplifying the overall computation of the optimal policy. 

V. Optimal Policy within the Complete Class n 

In this section we will consider the complete class II. Here we will only informally discuss the possible structure 
of the optimal policy within II. Formal analysis is still in progress at the time of this writing. However, we will 
write the expressions for the recursive cost-to-go functions (analogous to the ones in <(5j and (O). 

Recall that a policy within II, at stage fc, is in general allowed to take decisions based on the entire his- 
tory. Formally, at stage fc, let bk = max < Ri : i < k, relay i has been probed > and Fk = \ Fl { : i < 
k, relay i is unprobed>, then the decision at stage k is based on ibk,Fk)- Thus the set of all possible states 
(State Space SSt) at stage k is 



SS k 



Ub,G) :be^ + ,g = {G 1 ,G 2 ,--- ,G e },0 < £ < k,Gi € F,l < i < e\. (16) 



Again the actions available are stop, probe and continue. Further, if the action is to probe then one has to decide 
which relay to probe, among the several ones awake at stage k. 



Let Jk, k = TV, N — 1, • • • , 1, represent the optimal cost-to-go at stage k (for simplicity we are again using Jk), 
then, Jjv(fe) = — T]b, and 

JN(b, Q) = minj - 7jb,rjS+ min E Gi Jjv(max{&, i?J, g \ {Gj}) f- ( 17 ) 

For stage fc = N — 1, iV — 2, • • • ,1 we have 

J k (b) = min { - rjb, E A [,J k (b, {F ifc+I })] }, (18) 

J fc (6, Q) = min { - V b, V 5 + mm E Gi [j fc (max{6, Jfc}, \ {G 4 })] , E A [j k+1 (b 7 Q U {F Lfc+1 })] }. (19) 

A. Discussion on the Last Stage N 

Consider the scenario where the decision process enters the last stage N. Given the best reward value b, among 
the relays that have been probed, and the set Q of reward distributions of the unprobed relays, the source has to 
decide whether to stop or probe a relay (note that there is no continue action available at the last stage). This 
decision problem is similar to the one studied by Chaporkar and Proutiere in [ 13 1 in the context of channel selection, 
which we briefly describe now. Given a set of channels with different channel gain distributions, a transmitter has 
to choose a channel for its transmissions. The transmitter can probe a channel to know its channel gain. Probing all 
the channels will enable the transmitter to select the best channel but at the cost of reduced effective transmission 
time within the coherence period. On the other hand, probing only a few channels may deprive the transmitter of 
the opportunity to transmit on a better channel. The transmitter is interested in maximizing its throughput within 
the coherence period, which is analogous to the cost (combination of delay and effective reward, see (01) in our 
case, which we are trying to minimize. 

The authors in ifTJl , for their channel probing problem, prove that the one-step-look-ahead (OSLA) rule is optimal. 
Thus, given the channel gain of the best channel (among the channels probed so far) and a collection of channel 
gain distributions of the unprobed channels, it is optimal to stop and transmit on the best channel if and only if the 
throughput obtained by doing so is greater than the average throughput obtained by probing any unprobed channel 
and then stopping (i.e., transmitting on the new-best channel). Further they prove that if the set of channel gain 
distributions is totally stochastically ordered (see Assumption [T), then it is optimal to probe the channel whose 
distribution is stochastically largest among all the unprobed channels. Applying the result of |[T3l to our model we 
can conclude that OSLA is optimal once the decision process enters the last stage N. Thus given a state, (b, Q), at 
stage N it is optimal to stop if the cost of stopping is less than the cost of probing any distribution from Q and 
then stopping. Otherwise it is optimal to probe the stochastically largest distribution from Q. 

B. Discussion on Stages k = N — 1, N — 2, ■ ■ ■ ,1 

For the other stages k — N — 1,N — 2,- ■■ ,1, one can begin by defining the stopping sets S k , S k and 
stopping/probing set Q^ analogous to the ones in ([T2| >. ( fT3l l and ( TT4i >. Note that this time we need to define 
S^ and Q^ for a set of distributions Q unlike in the earlier case where we had defined these sets only for a given 
distribution Ft. We conjecture that it is possible to show the results analogous to the ones in Section ||y] namely 



Theorem [2] and [3] where we prove that the stopping sets are identical for every stage k. Formal analysis is still in 
progress at the time of writing this paper. However, in the numerical results section (Section IVH while performing 
value iteration we observed that our conjecture is true, at least for the example considered for the numericals. 

Finally, from the discussion in the previous sub-section we know that, at stage N, suppose it is optimal to probe 
when the state is (b, Q) then it is best to probe the stochastically largest distribution from Q. We also conjecture 
that such a result will hold for every stage k, which is true for the numerical example considered in Section [VI] 

VI. Numerical Results 

The optimal policy within the restricted class (Section [III] and HVb is allowed to keep only one unprobed relay 
awake in addition to the best probed relay, while within the complete class (Section [V]), the optimal policy can keep 
all the unprobed relays awake. We will refer to the former policy as RST-OPT (to be read as, ReSTricted-OPTimal) 
and the latter as simply, GLB-OPT (read as, GLoBal-OPTimal). In this section we will compare the performance 
of the RST-OPT against GLB-OPT. First we will briefly describe the relay selection example considered for the 
numerical work. 

Recall the local forwarding problem described in Section [II] The source and sink are seperated by a distance 
of Vq = 10 unit (see Fig. Q]). The radius of the communication region is set to 1 unit. There are N = 5 relays 
within the forwarding region C. These are uniformly located within C. To enable us to perform value iteration (i.e., 
recursively solve the Bellman equation to obtain optimal value and the optimal policy), we discretize the forwarding 
region C by considering a set of 20 equally spaced points within C and then rounding the location of each relay to 
the respective closest point. Recall the reward expression from (fl]). We have fixed, TN Q = 1, /? = 2 and a = 0.5. 
We finally normalize the reward to take values within the interval [0, 1] and then quantize it to one of the 100 
equally spaced points within [0, 1]. The inter-wake-up times {Uk} are exponential with mean r = 0.2. 
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Fig. 3. Average total cost vs. the Lagrange multiplier r\. 



In Fig. [3] we plot the average total cost (see (O), incurred by RST-OPT and GLB-OPT, as a function of the 



Lagrange multiplier r\ for two different values of the probing cost 5 (namely S = 0.1 and S = 0.01). The total cost 
is decreasing with r/. First, observe that the total cost incurred, by either policy, for 6 — 0.01 is smaller than for 
S = 0.1. This is because, when the probing cost is smaller, each of the policy will end up probing more relays, 
thus accruing a larger reward and yielding a lower cost. Next, since GLB-OPT is optimal over a larger class of 
policies, we know that, for a given S, GLB-OPT should incur a smaller cost than RST-OPT. However, interestingly 
from the plot we observe that the difference between both the costs is small. Also, from the figure, note that the 
cost difference for S = 0.01 is smaller than that for 6 = 0.1. This is because, as the probing cost is decreased, both 
the policies will start behaving identically by probing most of the relays until they stop. Finally, when there is no 
cost for probing, i.e., for 6 = 0, we expect both the policies to be identical. 
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Fig. 4. Individual components of the total cost in Fig.fjas functions of rj. |(a)|Avg. Delay [(b)! Avg. Reward and|(c)|Avg. Probing Cost. 



In Fig. |4(a)||4(b)| and |4(c)| we plot the individual components (namely delay, reward and probing cost, respectively) 
of the total cost, as a function of rj. All the curves in these figures are increasing with i], except for (5 = 0.1 where 
the average reward and probing cost of RST-OPT shows a small decrease when r\ varies from 40 to 60. As r\ 
decreases we observe, from Fig. |4(a)| that all the average delay curves converge to 0.2 (recall that r = 0.2 is the 
average time until the first relay wakes up). This is because, for small values of 77, the source values the delay 
more (recall (O) and hence always forwards to the relay that wakes up first, irrespective of its reward. For the 
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same reason the average probing cost curves in Fig. |4(c)| converge to their respective 5 value which is the cost for 
probing a single relay. Similarly the average reward in Fig. |4(b)| converges to the average reward value of the first 
relay. 

Finally on the computational complexity of both the policies. To obtain the policy GLB-OPT we had to recursively 
solve the Bellman equations in (fT~8b and ( fl9l ), for every stage k and every possible state at stage k, starting from 
the last stage N, see ( fTTI i. The total number of all possible states at stage k, i.e., the cardinality of the state space 
SSk in ( fT6] >, grows exponentially with the cardinality of T (assuming that T is discrete like in our numerical 
example). It also grows exponentially with the stage index k. While in contrast for computing RST-OPT, since 
within the restricted class, at any time, only one unprobed relay is kept awake, the state space size grows linearly 
with the cardinality of T ' . Further, the size of the state space does not grow with the number of relays N, From 
our analysis in Section [IV] we know that the stopping sets are threshold based and moreover the thresholds (xu and 
\x\. : Fg G J 7 }) are stage independent, so that these thresholds have to be computed only once (namely for stage 
N — 1), further reducing the complexity of RST-OPT. 

RST-OPT can also be regarded as energy efficient in the sense that it keeps only one unprobed relay (in addition 
to the best probed one) awake while instructing the other unprobed ones to switch to a low power OFF state (i.e., 
sleep state). While GLB-OPT operates by keeping all the relays, that have woken up thus far, awake. The close 
to optimal performance of RST-OPT, its computational simplicity and energy efficiency is motivating us to apply 
RST-OPT at each hop enroute to the sink in a large sensor network and study its end-to-end performance, which 
is a part of our ongoing work. 

VII. Conclusion and Future Scope 

We considered the sequential decision problem of choosing a relay node for the next hop, by a forwarding node, 
in the presence of random sleep-wake cycling. In the model, we have incorporated the energy cost of probing a 
relay to learn its channel gain. We have analysed a restricted class of policies where any policy is allowed to keep 
at most one unprobed relay awake, in addition to the best probed relay. The optimal policy for the restricted class 
(RST-OPT) was characterized in terms of, the stopping sets and the stopping/probing sets. First, we showed that the 
stopping sets are threshold in nature. Further, we proved that the thresholds, characterizing the stopping sets, are 
stage independent, thus simplifying the computation of RST-OPT in comparison with the global optimal, GLB-OPT 
(whose decisions at any stage is based on the entire history). Numerical work confirmed that the performance of 
RST-OPT is close to that of GLB-OPT. 

As mentioned earlier, a part of our ongoing work is to study the end-to-end performance (i.e., total delay and 
overall power consumption) of RST-OPT for a given network. Another interesting direction is to apply cooperative 
techniques for forwarding in sleep-wake cycling networks and study its benefits. 
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VIII. APPENDICES 

A. Motivation for the Particular Reward Structure 

In this section we will motivate the reason for considering the particular reward structure (namely a weighed 
ratio of progress and power, see (Q])). We were interested in minimizing end-to-end delay subject to a constraint 
on the total transmission power. We considered solving the local forwarding problem for the proposed reward in 
([T). However we assumed a deterministic channel model (i.e., no fading or equivalently \H(\ 2 = 1 in (Q])) so that 
the exact reward value is revealed by the relays upon waking up. Therefore our earlier work H can be applied 
to obtain a simple threshold (on reward) based rule. We have applied this policy at each hop towards the sink 
in a network comprising of 500 nodes that are randomly sleep-wake cycling. We observed that, for a = 0.5, the 
performance of our policy (in terms of total average delay and average transmission power) is comparable to the 
performance of the LOCAL-OPT algorithm (a Bellman-Ford based algorithm) proposed by Kim et al. {231 . The 
performance tradeoff curves are shown in Fig. [5] Though these observations were for the model with deterministic 
channels, we want to consider the same reward structure in the present work as well, where the channel gains are 
random and hence the source has to probe a relay to learn its reward. 
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Fig. 5. Curves showing the tradeoff between the average end-to-end delay and the total power for various policies. 



B. Proof of Lemma \2\ 

First, let us make a note of the following lemma, 

Lemma 8: If Fg > s t F u , then for any b we have F msxx ^ bRe ^ > st ■F max .r& j^}, where the distributions of Rg and 
R u are Fg and F u , respectively. 

Proof: The proof easily follows by noting that 

if x < b 

Fi(x) otherwise. 



Fma.x{b,R e }i x ) 
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Proof of Lemma\2^(i): Proof is by induction. For stage N we know that Jat(6) = —rjb, and hence is decreasing 
in b. Then, from (0| it is clear that Jat(6, Fg) is also decreasing in b. Thus, the monotonicity properties holds for 
k = N. Suppose Jk+i (b) and Jk+i (b, Fg) are decreasing in b for some k + 1 = 2, 3, • • • , N, We will show that 
the result holds for k as well. 

Recall the expressions of ■/*.(&) and Jk(b, Fg) ( dTOb and (ITTb respectively): Jj.(6) = min j — 776, cc k (b) > and 
J k (b,Fg) — mini — 776, cp k (b, Fg) , cc k (b, Fg) >. Thus to complete the proof it is sufficient to show that cc k (b), 
cpk{b,Fg) and cc k {b,Fg) are decreasing in b. 

From the induction hypothesis, it is easy to see that cck{b) (in (0) is decreasing in b, using which we can 
conclude that J k (b) (see ( TTOl i) is decreasing in b. Now that we have establised Jk(b) is decreasing in b, it will 
immediately follow (from ©) that the probing cost cp k {b, Fg) is also decreasing in b. 

Again using the induction argument observe that min i J k+ i(b,Fg), J k+ i(b,FL h+1 ) > is decreasing in b so that 
the continuing cost cc k (b,Fg) (in (O) is also decreasing. ■ 

Proof of Lemma]2^(ii): Consider stage N and recall the optimal cost-to-go function Jpf(b,Fg) from (0J, 

Jiv(&, Fg) = min i — T]b,rj5 — r/Eg max{b,R(} >. 

Using the definition of stochastic ordering (Definition [TJ and Lemma [8] we can write 



E, 



max{6,_R^} > E„ max{fe, R u } 



so that the result holds for stage N. Suppose the result holds for some k + 1 = 2,3,- •• ,N, We will show 
that the cost of probing and the cost of continuing, in (O and dS), both satisfy cpk(b,Fg) < cpk(b,F u ) and 
cck(b,Fg) < cck(b,F u ). Then the result easily follows for stage k by recalling (from ( fTTl i) that Jk(b,Fg) = 
min J -ijb,cp k (b,Fe),cc k (b,Fi)j. 

From Lemma |2}(i) we know that Jfc(fe) is decreasing in b. Again using Definition [T] and Lemma [8] we can 
conclude that 



E^ 



J fc (max{&, R e }) < E u J fe (max{6,i? u }) 



so that cp k {b,F t ) < cp k (b,F u ) (see ©). 

From the induction argument we know that J k+ i(b, Fg) < J k+ i(b, F u ) so that 

min I J k+1 (b, Fg), J k+1 (b, F Lk+1 )\ < mm I J k+1 (b, F u ), J k+1 (b 7 F Lk+1 )\ . 

Therefore cc k (b 7 Fg) < cc k (b,F u ) (see $Q). ■ 

Proof of Lemma ]2](iii): This result is very intuitive, since with more number of stages to go, one is expected 
to accrue less average cost. However, we prove it here for completeness. 

Again the proof is by induction. For stage N — 1 we know that, Jjv-i(^) = rnin < —rib, cc k (b) >. Replacing — rjb 
with Jat(6) in the previous expression we obtain Jat_i(6) < Jat(6). Next, consider a state of the form (b, Fg). The 
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cost of probing cpN-i{b, Fg) can be bounded as follows, 



cpN-i(b,Fi) 



r)5 + E e 



JAr_i(max{6, Re}) 



< r]S + E e J/v(max{fo, Rg}) 



r]S - rjEe 



i{b,Re} 



where, in the second inequality we have used, Jat_i(6) < Jjv(6), which we have already shown. Therefore, 



J N -i(b,F e ] 



in | - r}b,<? N _ 1 (b,F i ),c c N _ 1 (b,F e )} 
{b,Re}]} 



maxi 



< min < — 

< min < — rjb, i]6 — rjE 
= J N (b,Fe). 

Thus we have shown the result for stage N — 1. Suppose the result holds for some stage k + 1 = 2, 3, • • • ,7V— 1. 
i.e., Jk+i(b) < Jk+2(b) and J k+ i(b,Fg) < J k+2 (b,Fi). Using the induction hypothesis, the cost of continuing 
cck(b) can be bounded as 

cc k {b) = t + E a J k +i(b,F k+1 ) 

< T + E A [j k+2 (b,F k+2 ) 

= 4 +1 (b). 

Then J k (b) < J k+ i(b) (see (fT0T>~). Recall the expression of J k (b,Fi) from fTTV To show that J k (b,Fg) < 
J k +i(b,F(), it is sufficient to show that cp k (b,Fi) < cp k +i(b,Fi) and cc k {b : Fg) < cc k +i(b,Fi). First consider 
the probing cost, 

cp k (b,F e ) = T]5 + E e J k (max{b,R e }) 



< T]S + E t 



J k+ i{ma,x{b,R e }) 



= cp/c + i(6, Ft) 

where, in the second inequality we have used J k (b) < J k +\(b), which we have already shown. Finally, consider 
the cost of continuing, 

cc fc (6,F £ ) = r + E.4 mm{J k +i{b,Fi),J k+1 {b,F Lk+1 )} 

< t + E A [min{ J k+2 (b, F e ), J k+2 (b, F Lk+2 )} 

= cc k +i(b,Fi), 

where the second inequality is obtained by simply using the induction argument. 
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C. Proof of Lemma \5\ 

The following property about the min operator will be useful while proving Lemma [5] 

Lemma 9: If X\, x^, • • • , Xj and y\ , y 2 , ■ ■ ■ ,yj in 5ft, are such that, xi — yi<x\ — y\ for all i — 1 , 2, • • • ,j, then 

min{xi,a;2, • • • ,Xj} - mm{y 1 ,y 2 , ■■ ■ ,yj} < Xi - x 2 (20) 

Proof: Suppose rnin{j/i, y 2l • • • ,yj} — yt, for some 1 < i < j, then the LHS of J20] > can be written as, 

LHS = min{a:i,a;2, • • ■ ,Xj} - yi < x, - y,. 

The proof is complete by recalling that we are given, x% — j/j < X\ — y\. ■ 

Proof of ' Lemma \5\ Since Jat(6) is simply — r\b we have, for stage N, Jtv(6i) — JnQ>2) = 17(62 — 61). Also, 
for a given distribution F( and for b 2 > b%, 

cp N (bi,F e ) - cp N (b 2 , Fi) = r]E e max{6 2 , Ri\ - max{6i, Rt} 

< 17(62-61). 

which, along with Lemma|9] implies that Jj^(bi,F^) — Jjv(62,i^) < 17(62 — 61). Suppose for some stage k + 1 = 
1,2, ••• ,N we have Jfc+i(6i) - Jk+i{b 2 ) < 17(62 - 61) and J fe+1 (6i,i<i) - J k+1 (b 2 ,F e ) < 77(62 -61). Then we 
will show that all the inequalities listed in the lemma will hold for stage k as well. First, a simple application of 
the induction hypothesis will yield, 

ccfe (61 ) -ccfe (62) = Ea Jk+i(h,F Lk ) - J k+ i(b 2 ,F Lk ) 

< 17(62-61), 

which in turn (along with Lemma|9]l implies Jfe(6i) — Jk(b 2 ) < 17(62 — 61), using which we can write 

cp k (b!,Fi) -cp k (b 2 ,F e ) = E £ J k (max{bi,Ri}) - J fc (max{6 2 ,^}) 

< E^ ry(max{fe 2 , -R^} — max{&i, -R^}) 

< »?(6a-6i)- 

Define £^ as the set of all distributions that are stochastically greater than t, i.e., Ci := {Ft E J 7 : Ft > s t Ft}. 
Let CI := T \ Cg. From Assumption Q] (that T is totally stochastically ordered), it follows that C\ contains all 
distributions in T which are stochastically smaller than F(. Then, using Lemma |2jii) the cost of continuing can be 
simplified as, 



cc k (61, Ft) -cc k (b 2 ,F e 



(Jfc+i (&i,*t) - J k+1 (b 2 ,F t ))dA(t) 



+ / (J k+1 (bi,F e ) - J k+1 (b 2 ,F e ))dA(t). 



Now, a straight forward application of the induction argument gives the desired result. Finally, the induction argument 
is complete, by using Lemma|9] to conclude that J k (bi,Fi) — J k (b 2 , Ft) < 77(62 — 61). ■ 
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D. Proof of Lemma [7J 

As discussed in the outline of the proof of Lemma [7j the result immediately follows once we prove Step 1 and 
Step 2. 

Lemma 10: Suppose F u is a distribution such that for all k = 1, 2, • • • , N—l, Sk C Q£. Then for any distribution 
Ft >st F u we have Sk Q Q e k . 

Proof: We will first show that Sn-i C C n _ v Fix a b G 5jv_i. Then fe G Qat-i (because it is given that 
Sn-i C Qjy_ 1 ), so that using the definition of the set Q'%^_ 1 (from (TBI ) we can write 

min | - ??&, ?7(5 + E„ JAr_i(max{6, R u }) > < cCN-i(b,F u ). (21) 

Since b G <Sjv-i> for any distribution i 7 ^, the minimum of the cost of stopping and the cost of probing can be 
simplified as 

min < — rjb, r]6 + K s Jn-i (max{6, R s }) > = min < — i]b,r]5 — ?yE s 

= J N (b,F s ) (22) 

where, in the first equality we have replaced Jjv-i(max{6, R s }) by the cost of stopping (— rymax{6, R s })- This 
is because, after probing we are still at stage k with the new state being max{6, R s }, which is also in Sn-i 
(Lemma [TJ and in <Sjv_i we know that it is optimal to stop, so that Jiy^i(max{b,R s }) = —rjma.x{b,R s }. 

Thus we are given, J^(b,F u ) < ccM-i{b,F u ) (use (l22l in (BTTi). Also from Lemma |2}(ii), since we are given 
Ft > s t F u , we have J;y(b,Fg) < Jpf(b,F u ), Combining these we can write 



max 



{b,R s }]} 



JN{b,F £ ) < J N (b,F u ) < cc JV -i(6,F tl ). 



(23) 



By recalling ( f22l and using ( |2T1 i. we need to show, Jat(6,F^) < cc;y-i(b,Fi) to conclude that b G Q%^i- 

Now for any distribution F a S T define C s := {t G C : F t > s t F s } i.e., C s is the set of all distributions in T 
that are stochastically greater than F s . Let C c s = C\C S . Because of the condition imposed on T that, it is totally 
stochastically ordered (Assumption [U, C-1 contains all distributions in T that are stochastically smaller than F s . 
Further, for Ft > s t F u we have Ct C C u . Then recalling the expression for ccN-i(b,F u ) from (O we can write 



mm{J N (b, F u ), J N (b, F Ln )} 
J N (b,F t ) dA(t)+ f J N (b,F u ) dA(t) 



cc N ^i(b,F u 



J N (b,F t ) dA{t) + / J N (b,F t ) dA(t) + / J N (b,F u ) dA(t), 

Ci Jc u \Ct Jci 

where to obtain second equality we have used Lemma|2]-(ii) and the definition of C u . For any Ft G C u \Ci we know 
that Ft > s t F u so that J^(b,F t ) < Jiy(b,F u ) (again from Lemma |2jii)). Thus replacing J^(b,F t ) by JN{b,F u ) 
in the middle integral above we obtain 

cc N - X (b,F u ) < t+ [ J N {b,F t ) dA{t)+ [ dA(t) J N (b,F u ) (24) 
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From J23l ) and d24l > we see that we have an inequality of the following form 

J N (b,F e ) < J N (b,F u ) <c + pJ N (b,F u ), 
where c — t + J^ Jat(6, Ft) dA(t) and p = J\. c dA(t). Since p £ [0, 1] we can write 

J N {b, F e )(l -p)< J N {b, F U ){1 - p) 
rearranging which, we obtain 

J N (b 7 F e ) < P J N (b,Ft) + J N (b,F u )-pJ N (b,F u ) 

< P J N (b, F t ) + c + pJ N (b, F u ) - pJ N (b, F u ) 
= c + P J N (b,F e ). 

The proof for stage N — 1 is complete by noting that c + pj^(b, Ft) is ccN-i(b, Ft). 

Suppose that for some k + 1 = 2, 3, • • • , N — 1 we have S k +i Q Qk+i- We will have to show that the same 
holds for stage k. Fix any b G Sk, then for any distribution F s , exactly as in (l22l 

min < — r]b,r)6 + ¥, s Jk{ma,x{b,R s }) > = min < — rjb,r]5 — rfE s max{6, R s } > 

= J N (b,F s ). (25) 

Thus the hypothesis Sk Q 0% implies Jat(6, F u ) < cck(b,F u ) and we need to show Sk C Qj,, i.e., J^(b,Ft) < 
cck(b,Ft). Proceeding as before (see (l24l >) we obtain 

cc k {b,F u ) < t+ f J k+1 {b,F t ) dA{t)+ I dA(t) J k+1 {b,F u ). 
Jet Jc e e 

Now using Lemma [6] we conclude 

cc k {b,F u ) < t+ f J k+1 (b,F t )dA{t)+ f dA(t) J N {b,F u ). 
JCi J ci 

Note that the conditions required to apply Lemma [6] hold i.e., b G <Sfe+i ( since S k C S k +\ from Lemma |4}(iv)) 
and S k+ i C Q k+1 (this is given). 

Thus, again we have an inequality of the form Jpj(b,Ft) < Jat(6,F u ) < d + pJN{b,F u ) (where c' = t + 
J„ J k +i(b, F t ) dA(t)). As before we can show that Jat(6, Ft) < c' +pj?j(b, Fg). Finally the proof is complete by 
showing that c' +pJjv(6,F^) = cc k (b,Ft) as follows, 



cc k (b,Ft) = t+ J k+1 (b,F t ) dA{t)+ J k+1 (b,F e )dA(t) 
Jd J c\ 

= c'+ P J N (b,F e ), (26) 

where to replace J k+ i(b,Ft) by J^(b,F() we have to again apply Lemma [6] However this time S k +i C Q k+1 , is 
by the induction hypothesis. ■ 

To use the above lemma we still require a distribution F u satisfying S k C QV, for every k. The minimal 
distribution F m turns out to be useful in this context. 
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Lemma 11: For every k = 1, 2, • • • ,N - 1, S k C Q%. 

Proof: Since F m is a minimal distribution (see Assumption Q}, from Lemma |2}(ii) we know that 

J fe +i(6,F ife+1 )< J k+1 (b,F m ). 

Using the above expression in ^ and then recalling ©, we can write cc k {b,F m ) = cck(b). Finally, the result 
follows from the definition of the sets Q™ and S k . ■ 

E. Proof of Theorem 

Proof: Recalling the definition of the set S^ (from (fT3b ). for any b G S^ +1 we have 

-776 < mini cp k+1 (b, F e ), cc k +i{b,Fi)\. 

Suppose, as in Theorem|2] we can show that for any b G <S|, the various costs at stages fc and fc + 1 are same, i.e., 
cp k (b, Ft) = cp k+ i(b, Fg) and cc k (b, Ft) — cc k +i(b, Ft), then this would imply, <S| D <Sfc+i • Th e proof is complete 
by recalling that we already have S^ C <Sf. +1 (from Lemma lU-(iv)). 

Fix a & G 5f, +1 . To show that cp k (b,Fe) = cp k +i(b,Fi), first using Lemma IU-(ii) and Theorem |2] note that 
Sl +1 Q S k+ i = S k . Since b G S k+ i the cost of probing is 

cp k+ i(b,F e ) = i]S + Ei J k+ i(ma,x{b, Ri}) 
= r/S — r/Ei max{6,_R^} 

where, to obtain the second equality, note that max{&, Re} G S k (Lemma[T|i and hence at max{&, Ri} it is optimal 
to stop, so that J k+1 (max{b, Ri}) = —r]m&x{b,Re}. Similarly, since b is also in S k the cost of probing at stage 
k, cp k (b, F(), is again r]6 — r;E^> max{&, Re} . 

Finally, following the same procedure used in Theorem [2] to show cc k (b) = cc k+ i(b), one can prove that 

cc k (b,F £ ) = cc k+ i(b,F e ). M 



